Models for Dichotomous Categorical Outcomes
## Warning: package 'mfx' was built under R version 3.5.2
Models for Dichotomous Categorical Outcomes
The binomial distribution will help us understand how dichotomous variables are distributed. The binomial distribution arises in situations where:
A simple example of this type of process would be rolling five dice (\(n=5\)) and counting the number of sixes (\(p=0.167\)).
The binomial formula gives the probability that the random variable representing the number of successes \(X\) will be some specific number \(k\). This formula is:
\[P(X=k)={n \choose k}p^k(1-p)^{n-k}\]
This formula can be broken down into two parts.
When events are independent of one another, then the probability that they all occur is given by multiplying their individual probabilities together. Therefore, the probability of getting a certain number of successes and failures is given by just multiplying the probabilities of success (\(p\)) and failure (\(1-p\)) together.
For example, if we wanted to know the probability of getting 2 successes and 3 failures for the dice rolling example we would just take:
\[(1/6)(1/6)(5/6)(5/6)(5/6)=(1/6)^2(5/6)^3=0.016\] More generally, we can just say: \[p^k(1-p)^{n-k}\]
Does 1.6% seem like a low probability of rolling exactly two sixes in five dice rolls? That is because it is! This is only the probability of rolling any particular sequence of successes and failures. However, its possible to roll two successes and three failures in multiple permutations. Here are all the ways:
SSFFF, SFSFF, SFFSF, SFFFS, FSSFF, FSFSF, FSFFS, FFSSF, FFSFS, FFFSS
There are ten possible ways to combine two successes and three failures. Therefore, the total probability of rolling exactly two sixes in five dice rolls is:
\[10*(1/6)^2(5/6)^3=0.161\]
The \({n \choose k}\) formula provides a mathematical way to determine how many ways \(k\) successes can be distributed in \(n\) trials. The full formula is:
\[{n \choose k}=\frac{n!}{k!(n-k)!}\]
The exclamation points indicate a factorial which means you multiply that number by each integer lower than it in succession. For example:
\[4!=4*3*2*1\]
Typically, many of these values will cancel in the numerator and denominator, so calculation is not too hard for small \(n\) and \(k\). Lets take the example of two successes in five trials:
\[{5 \choose 2}=\frac{5!}{2!(5-2)!}=\frac{5*4*3*2}{2*3*2}=5*2=10\]
X <- 0:5
prob <- NULL
for(k in X) {
prob <- c(prob, choose(5,k)*(1/6)^k*(5/6)^(5-k))
}
par(mar=c(4,4,1,1))
barplot(prob, space=0, names.arg=X, las=1, col="darkgreen")
X <- 0:20
prob <- NULL
for(k in X) {
prob <- c(prob, choose(20,k)*(1/6)^k*(5/6)^(20-k))
}
par(mar=c(4,4,1,1))
barplot(prob, space=0, names.arg=X, las=1, col="darkgreen")
Models for Dichotomous Categorical Outcomes
In an earlier slide, I said you can’t put categorical variables on the left hand side, but this isn’t entirely true. R will estimate a model with a boolean variable on the left-hand side. Lets try this out with survival on the left hand side and fare paid on the right hand side.
model.lpm <- lm((survival=="Survived")~fare, data=subset(titanic, !is.na(fare))) summary(model.lpm)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.305551559 0.0154955468 19.718669 5.595816e-76 ## fare 0.002296519 0.0002519465 9.115105 2.877047e-19
R estimates a model with a slope and intercept, but how do we interpret these numbers?
By treating survival as a 1 and death as a 0, we are implicitly modeling the probability of surviving as a function of fare:
\[\hat{p}_i=0.3059+0.0023(fare_i)\]
We would interpret the intercept and slope as follows:
This kind of model is called the linear probability model (LPM) and while seemingly useful, it has serious flaws.
The first problem of heteroscedasticity can be fixed fairly easy by use of generalized least squares (GLS). Remember that GLS applies a weighting matrix to the regression model to account for autocorrelation or heteroscedasticity. In the case of heteroscedasticity, the solution is to use the inverse of the variance for each observation as a weight.
Since we know that the variance is given by \((p_i)(1-p_i)\), we can use the predicted values from an OLS regression model to derive weights. Because we are using an estimate to generate the weights, this method is called feasible generalized least squares (FGLS).
p.predict <- model.lpm$fitted.values p.predict[p.predict>0.99] <- 0.99 w <- 1/(p.predict*(1-p.predict)) model.fgls <- update(model.lpm, weights=w) cbind(coef(model.lpm),coef(model.fgls))
## [,1] [,2] ## (Intercept) 0.305551559 0.324832554 ## fare 0.002296519 0.001457149
I had to fudge probabilities>1 a bit to make sure the weight formula produced a real positive value.
FGLS can be improved by iterating it multiple times until the results stop varying. At this point, we have done iteratively reweighted least squares (IRLS) which should correct for heteroscedasticity. Lets try iterating 8 times to see if that stabilizes the result.
model.last <- model.lpm
coefs <- coef(model.lpm)
for(i in 1:8) {
p.predict <- model.last$fitted.values
p.predict[p.predict>=1] <- max(p.predict[p.predict<1])
w <- 1/(p.predict*(1-p.predict))
model.last <- update(model.last, weights=w)
coefs <- cbind(coefs,coef(model.last))
}
round(coefs, 4)
## coefs ## (Intercept) 0.3056 0.3126 0.3071 0.3096 0.3082 0.3089 0.3085 0.3087 0.3086 ## fare 0.0023 0.0019 0.0021 0.0020 0.0021 0.0020 0.0020 0.0020 0.0020
We reached convergence down to the fourth decimal place by the eighth iteration.
summary(model.last)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.308631881 0.0151178824 20.415021 1.251217e-80 ## fare 0.002034007 0.0002160233 9.415681 2.046953e-20
We have now fixed the first problem of heteroscedasticity. However, its still possible to get predicted values that fall outside the 0 to 1 range.
Note: I technically only “sort of” solved for heteroscedasticity in this model because I had to initially fudge my predicted values to get them in the range of 0 to 1 and it turns out the results are highly dependent on how that fudge is done.
The logit transformation will do this for us. The logit is the log-odds of success. It can be calculated as:
\[logit(p)=\log(\frac{p}{1-p})\]
The part inside the log function is the odds of success, which is just the ratio of the probability of success to the probability of failure. We saw odds last term.
Probabilities must lie between 0 and 1, but the odds lies between 0 and infinity. If we log the odds then the resulting number will lie between negative infinity and infinity. So, the logit transformation allows us to stretch the probability out across the full number line.
For any real value of the logit \(g\), you can compute the probability by reversing this formula:
\[p=\frac{e^g}{1+e^g}\]
This transformation would seem to give us a solution. If we predict the logit of the probability of survival rather than the probability of survival, then all of the predicted values will be between zero and one when reverse-transformed.
\[\log(\frac{p_i}{1-p_i})=\beta_0+\beta_1(fare_i)\]
Models for Dichotomous Categorical Outcomes
We can think about the OLS regression model as a partition of the variance in a set of observed values for a dependent variable into a component that is accounted for by some function of independent variables and a random part that is not accounted for by the independent variables:
\[observed = structural + stochastic\]
So, we can write this conceptual equation more formally as:
\[y_i=\hat{y}_i+\epsilon_i\]
The term \(\hat{y}_i\) in the equation below is the “model” part of our equation because it predicts \(y\) by \(x\) through some functional form.
\[y_i=\hat{y}_i+\epsilon_i\]
What is this functional form? Typically, we assume a linear relationship such that:
\[\hat{y}_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_2x_{ip}\] Substituting this equation for \(\hat{y}_i\) above gives us the full OLS regression model formula:
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_2x_{ip}+\epsilon_i\]
Depending on disciplinary norms, there are different conceptual ways to view the basic relationship of:
\[observed = structural+stochastic\]
Mathematically, these are all identical. The baggage about what the math means is all in your head.
To better understand generalized linear models (GLM), we can first restate our OLS regression model using the language of the GLM. In order to do that I want to return to this particular set up of the OLS regression model:
\[y_i=\hat{y}_i+\epsilon_i\] \[\hat{y_i}=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_px_{ip}\] There are two key concepts to understand here:
This formulation is often thought of as a description of a data-generating process. In this case, the data-generating process is simple to the point of crudeness: you get a predicted outcome (\(\hat{y}_i\)) from a linear combination of an observation’s values on the independent variables, and then you reach into a distribution (the same distribution each time because iid) and pull out a value that represents all the random stuff that you tag on the end.
We can write the distribution of \(\epsilon_i\) as:
\[\epsilon_i\sim N(0,\sigma)\]
We can then rethink the data-generation process for \(y_i\) as reaching into a distribution:
\[y_i \sim N(\hat{y}_i,\sigma)\]
To get the value of \(y_i\) for any observation we reach into a normal distribution that is centered on the predicted value, but that always has the same standard deviation. The predicted value is given by a linear combination of the independent variables:
\[\hat{y_i}=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_px_{ip}\]
Consider this reformulation of the OLS regression model:
\[y_i \sim N(\hat{y}_i,\sigma)\] \[\hat{y_i}=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_px_{ip}\]
These two equations identify the two key components of a GLM:
The glm command with the family option will allow you to run a model and specify the error distribution and link function.
summary(lm(Violent~MedianAge, data=crimes))$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1343.9360 434.20828 3.095141 0.003248259 ## MedianAge -25.5795 11.55514 -2.213690 0.031535942
summary(glm(Violent~MedianAge, data=crimes, family=gaussian(link=identity)))$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1343.9360 434.20828 3.095141 0.003248259 ## MedianAge -25.5795 11.55514 -2.213690 0.031535942
The results are identical.
glm command is using a different technique called maximum likelihood estimation (MLE) to estimate coefficients. We will discuss how this works in the next section.glm, the summary statistics will vary. The glm command, for example, will not give you \(R^2\).Returning to the Titanic data, the survival variable is just a set of ones and zeros that can thought of as being produced by a binomial distribution like so:
\[y_i \sim binom(1, p_i)\] We now have a binomial error distribution. The key parameter in this distribution is \(p_i\), the underlying probability of survival for each passenger. We can now transform this with a logit link, so that the independent variables are linearly related to the log of the odds of survival:
\[logit(p_i)=\log(\frac{p_i}{1-p_i})=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_px_{ip}\]
This formulation is often called the logistic regression model.
model.glm <- glm((survival=="Survived")~fare, data=titanic,
family=binomial(link=logit))
summary(model.glm)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.88402354 0.07550922 -11.707491 1.166809e-31 ## fare 0.01247006 0.00160488 7.770089 7.843126e-15
It works! … Or at least it did something. Interpreting the responses is the tricky part because we have to take account of the transformation used here. However, before we discuss interpretation, we need to discuss in more detail how R generated this estimate using maximum likelihood estimation (MLE).
Models for Dichotomous Categorical Outcomes
glm.Lets say we flip a coin \(n\) times and observe \(x\) heads. The values of \(n\) and \(x\) are the data. We know that this data-generating process is governed by the binomial distribution where the key parameter is \(p\), the probability of a heads on each trial.
Because we know that the data-generating process is a binomial distribution, we also know that the likelihood function is:
\[L(p)={n \choose x}p^x(1-p)^{n-x}\]
Note that this formula is identical to the binomial formula except that it is now a function of \(p\) (which is unknown) rather than \(n\) and \(x\) (which are known).
Lets take an example where \(x=13\) and \(n=50\). The likelihood function then becomes:
\[L(p)={50 \choose 13}p^{13}(1-p)^{37}\] We can plot this function by just choosing values of \(p\) at intervals of 0.01 from 0 to 1.
We can calculate a general result for the MLE of \(p\) from a binomial process by finding the maximum of the log-likelihood function: \[ \begin{aligned} \log L(p)&=\log({n \choose x}p^x(1-p)^{n-x})\\ \log L(p)&=\log {n \choose x} + x \log(p) + (n-x)\log(1-p)\\ \end{aligned} \] The reason we like to work with the log-likelihood is that it converts multiplicative relationships into additive relationships. In order to find the maximum of this function, we first need to find the derivative of the log-likelihood function with respect to \(p\):
\[\frac{\partial \log L(p)}{\partial p}=\frac{x}{p}-\frac{n-x}{1-p}\]
Now, we can solve for the value of p when this equation is zero to get the maximum.
Solve for p when function equals zero:
\[ \begin{aligned} 0&=\frac{x}{p}-\frac{n-x}{1-p}\\ \frac{n-x}{1-p}&=\frac{x}{p}\\ p(n-x)&=x(1-p)\\ pn-px&=x-px\\ pn&=x\\ p&=x/n \end{aligned} \]
Congratulations, we have proved the obvious! The most likely value of \(p\) with \(x\) heads in \(n\) trials is \(x/n\), the proportion of heads. Technically, we also need to show that the second derivative here is negative to show its a maximum and not a minimum, but we won’t do that here.
This case has a closed form solution. In practice, many cases (including the GLM) do not have closed form solutions for the MLE and iterative techniques have to be used instead.
We can write the predicted log-odds of an observation by vector multiplication as \(\mathbf{x_i'\beta}\). This log odds can be converted into a probability as:
\[\hat{p}_i=F(\mathbf{x_i'\beta})=\frac{e^{\mathbf{x_i'\beta}}}{1+e^{\mathbf{x_i'\beta}}}\]
The likelihood of a particular observation \(i\) being equal to \(y_i\) (1 or 0) is given by the bernoulli distribution:
\[L_i=F(\mathbf{x_i'\beta})^{y_i}(1-F(\mathbf{x_i'\beta}))^{1-y_i}\] The log-likelihood is equal to:
\[\log L_i=y_i\log F(\mathbf{x_i'\beta})+(1-y_i) \log (1-F(\mathbf{x_i'\beta}))\]
The log-likelihood for all the observations is just the sum of these individual log-likelihoods:
\[\log L= \sum_{i=1}^n \log L_i= \sum_{i=1}^n y_i\log F(\mathbf{x_i'\beta})+(1-y_i) \log (1-F(\mathbf{x_i'\beta}))\]
\[\log L = \sum_{i=1}^n y_i\log F(\mathbf{x_i'\beta})+(1-y_i) \log (1-F(\mathbf{x_i'\beta}))\]
glm by default) is a form of iteratively-reweighted least squares (IRLS).The IRLS technique calculates the next iteration (\(t+1\)) of \(\beta\) values from the current iteration (\(t\)) as:
\[\beta^{(t+1)}=\mathbf{(X'W^{(t)}X)^{-1}X'W^{(t)}z^{(t)}}\]
Like, WLS this technique uses a weighting matrix \(\mathbf{W}\). This weighting matrix only has elements along the diagonal that are equal to:
\[w_i=\hat{p}_i(1-\hat{p}_i)\] Where \(\hat{p}_i\) is the estimated probabilities from iteration (t).
The \(\mathbf{z}\) vector is a transformed vector of the dependent variable where each element is given by:
\[z_i=\mathbf{x_i'\beta}+\frac{y_i-\hat{p}_i}{\hat{p}_i(1-\hat{p}_i)}\]
Lets try this estimation procedure out by hand on the Titanic model predicting survival by fare.
Lets begin by setting up the X matrix and y values. Note that I have to get rid of missing values here.
X <- na.omit(as.matrix(cbind(rep(1,nrow(titanic)), titanic[,"fare"]))) y <- as.vector(as.numeric(titanic[!is.na(titanic$fare),"survival"]=="Survived"))
I calculate the average log odds of survival by the survival proportion and use this in a model.
lodds <- log(mean(y)/(1-mean(y))) beta <- c(lodds, 0) eta <- X%*%beta p <- exp(eta)/(1+exp(eta))
The \(p\) vector repeats the same probability. I can use this to calculate the log-likelihood and null deviance:
logL <- sum(y*log(p)+(1-y)*log(1-p)) deviance.null <- -2*logL deviance.null
## [1] 1741.024
Starting from this null model, I can now begin to iterate my results using the adjustments described in the previous slides. I will save each iteration of beta values in an object called beta.prev and I will also track my log-likelihood over iterations.
beta.prev <- beta
for(i in 1:5) {
w <- p*(1-p)
z <- eta + (y-p)/w
W <- matrix(0, nrow(X), nrow(X))
diag(W) <- w
beta <- solve(t(X)%*%W%*%X)%*%(t(X)%*%W%*%z)
beta.prev <- cbind(beta.prev, beta)
eta <- X%*%beta
p <- exp(eta)/(1+exp(eta))
logL <- c(logL, sum(y*log(p)+(1-y)*log(1-p)))
}
deviance <- -2*logL
round(beta.prev,5)
## beta.prev ## [1,] -0.48119 -0.80491 -0.87722 -0.88395 -0.88402 -0.88402 ## [2,] 0.00000 0.00973 0.01219 0.01247 0.01247 0.01247
logL
## [1] -870.5122 -828.9650 -827.4084 -827.3924 -827.3924 -827.3924
deviance
## [1] 1741.024 1657.930 1654.817 1654.785 1654.785 1654.785
glm command will do this for youbeta.prev[,6]
## [1] -0.88402354 0.01247006
model.survival <- glm((survival=="Survived")~fare, data=titanic,
family=binomial(logit),
control=glm.control(trace=TRUE))
## Deviance = 1658.387 Iterations - 1 ## Deviance = 1654.791 Iterations - 2 ## Deviance = 1654.785 Iterations - 3 ## Deviance = 1654.785 Iterations - 4
summary(model.survival)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.88402354 0.07550922 -11.707491 1.166809e-31 ## fare 0.01247006 0.00160488 7.770089 7.843126e-15
Models for Dichotomous Categorical Outcomes
The logistic regression model is used to predict dichotomous outcomes. It is a specific kind of GLM with a binomial error distribution and a logit link. Formally:
\[y_i \sim binom(1, \hat{p}_i)\]
\[log(\frac{\hat{p}_i}{1-\hat{p}_i}) = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_px_{ip}\]
Any probability \(p\) can be converted into an odds \(O\):
\[O=\frac{p}{1-p}\] The odds can also be converted back to a probability:
\[p=\frac{O}{1+O}\]
Two odds \(O_1\) and \(O_2\) can be directly compared by taking their ratio:
\[OR=O_1/O_2\]
Lets say we had an odds ratio of two between two groups and the group with the lower odds had even odds of 1. What are the two underlying probabilities?
\[p_1=\frac{O_1}{1+O_1}=\frac{1}{1+1}=\frac{1}{2}=0.5\] \[p_2=\frac{O_2}{1+O_2}=\frac{2}{1+2}=\frac{2}{3}=0.667\] In this case, a doubling of odds means an increase in probability from 50% to 67%. What if the lower odds was 9?
\[p_1=\frac{O_1}{1+O_1}=\frac{9}{9+1}=\frac{9}{10}=0.9\] \[p_2=\frac{18}{1+18}=\frac{18}{1+19}=\frac{18}{19}=0.947\] A doubling of the odds here necessarily leads to a smaller increase in probabilities because we are closer to one. The odds ratio separates the relationship from the level of success. This is a feature, not a bug.
table(titanic$sex,titanic$survival)
## ## Survived Died ## Female 339 127 ## Male 161 682
We can calculate men’s survival odds and odds ratio of survival for women compared to men:
161/682
## [1] 0.2360704
339*682/(161*127)
## [1] 11.30718
0.24 men survived for every one that died. The odds of survival for women is 11.3 times higher.
Lets use the logistic regression model to predict survival by sex:
summary(glm((survival=="Survived")~(sex=="Female"),
data=titanic, family=binomial(logit)))$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.443625 0.08762108 -16.47578 5.478130e-61 ## sex == "Female"TRUE 2.425438 0.13601950 17.83155 4.021316e-71
So we have the following model:
\[log(\frac{\hat{p}_i}{1-\hat{p}_i}) = -1.444+2.425(female_i)\] How do we interpret the slope and intercept?
\[log(\frac{\hat{p}_i}{1-\hat{p}_i}) = -1.444+2.425(female_i)\]
The dependent variable is literally measured in the log-odds of survival, but this is not very intuitive. We can convert directly to odds by exponentiating both sides:
\[e^{log(\frac{\hat{p}_i}{1-\hat{p}_i})} = e^{-1.444+2.425(female_i)}\] \[\frac{\hat{p}_i}{1-\hat{p}_i} = (e^{-1.444})(e^{2.425(female_i)})\] \[\frac{\hat{p}_i}{1-\hat{p}_i} = (0.24)(11.3)^{(female_i)}\] We now have a multiplicative model that describes the relationship between survival and sex.
\[\frac{\hat{p}_i}{1-\hat{p}_i} = (0.24)(11.3)^{(female_i)}\]
What are the odds of survival for men?
\[\frac{\hat{p}_i}{1-\hat{p}_i} = (0.24)(11.3)^{0}=(0.24)(1)=0.24\] What are the odds of survival for women?
\[\frac{\hat{p}_i}{1-\hat{p}_i} = (0.24)(11.3)^{1}=(0.24)(11.3)\]
I use the relevel function here to reset the reference category in order to simplify my model formulas. Then I run a bivariate model by gender, a model that controls for passenger class, and a model that interacts gender and passenger class.
titanic$survival <- relevel(titanic$survival, "Died") titanic$sex <- relevel(titanic$sex, "Male") model.gender <- glm(survival~sex, data=titanic, family=binomial(logit)) model.genderclass <- update(model.gender, .~.+pclass) model.interact <- update(model.genderclass,.~.+sex*pclass)
| Dependent variable: | |||
| survival | |||
| (1) | (2) | (3) | |
| Female | 2.43 | 2.52 | 3.98 |
| 2nd Class | -0.88 | -1.10 | |
| 3rd Class | -1.72 | -1.06 | |
| Female x 2nd Class | -0.16 | ||
| Female x 3rd Class | -2.30 | ||
| Constant | -1.44 | -0.41 | -0.66 |
| Observations | 1,309 | 1,309 | 1,309 |
| Log Likelihood | -684.05 | -628.61 | -605.02 |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Lets return to the example predicting survival by fare paid:
model.fare <- glm(survival~fare, data=titanic, family=binomial(logit)) summary(model.fare)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.88402354 0.07550922 -11.707491 1.166809e-31 ## fare 0.01247006 0.00160488 7.770089 7.843126e-15
\[log(\frac{\hat{p}_i}{1-\hat{p}_i}) = -0.882+0.012(fare_i)\] How do we interpret the slope and intercept in this case?
Lets exponentiate both sides again:
\[\frac{\hat{p}_i}{1-\hat{p}_i} = (e^{-0.882})(e^{0.012(fare_i)})\] \[\frac{\hat{p}_i}{1-\hat{p}_i} = (0.414)(1.012)^{fare_i}\]
How do the odds of survival compare for someone who paid no fare, one pound, and two pounds?
\[ \begin{aligned} \frac{\hat{p}_i}{1-\hat{p}_i}& = (0.414)(1.012)^{0})=(0.414)\\ & = (0.414)(1.012)^{1}=(0.414)(1.012)\\ & = (0.414)(1.012)^{2}=(0.414)(1.012)(1.012)\\ \end{aligned} \] Each one pound increase in fare is associated with a 1.2% increase in the odds of survival.
Lets use the model to predict odds and probabilities for the entire range of fare:
fare <- 0:512 odds <- 0.414*(1.012)^fare prob <- odds/(1+odds)
model.full <- glm(smoker~grade+sex*honorsociety+alcoholuse+sex, data=addhealth, family=binomial) round(summary(model.full)$coef,3)
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.142 0.294 -14.096 0.000 ## grade 0.215 0.029 7.360 0.000 ## sex Male -0.071 0.094 -0.756 0.450 ## honorsocietyTRUE -1.125 0.281 -3.997 0.000 ## alcoholuseDrinker 1.590 0.098 16.248 0.000 ## sex Male:honorsocietyTRUE 0.580 0.392 1.478 0.139
round(exp(model.full$coef),3)
## (Intercept) grade ## 0.016 1.240 ## sex Male honorsocietyTRUE ## 0.932 0.325 ## alcoholuseDrinker sex Male:honorsocietyTRUE ## 4.902 1.786
For a given vector of values of the independent variables given by \(\mathbf{x}\) and a vector of regression coefficients \(\mathbf{\beta}\), marginal effects can be estimated by first calculating the expected probability \(\hat{p}\):
\[\hat{p}=\frac{e^\mathbf{x'\beta}}{1+e^{\mathbf{x'\beta}}}\] The marginal effect of the \(k\)th independent variable in the model is then given by:
\[\hat{p}(1-\hat{p})\beta_k\]
coef(glm(survival~fare+age, data=titanic, family=binomial(logit)))
## (Intercept) fare age ## -0.51253566 0.01350664 -0.01366064
apply(titanic[,c("fare","age")], 2, mean, na.rm=TRUE)
## fare age ## 33.27619 29.73848
p <- exp(-0.341+0.0134*33.295-0.017*29.88)/(1+exp(-0.341+0.0134*33.295-0.017*29.88)) p*(1-p)*c(0.0134,-0.017)
## [1] 0.003217705 -0.004082163
The marginal effect of a one pound increase in fare when fare and age are at the mean is a 0.3% increase in the probability of survival. The marginal effect of a one year increase in age when fare and age are at the mean is a 0.4% decrease in the probability of survival.
The marginal effects for categorical variables are different. Typically you will estimate the difference in probability of survival across the categories when at the mean on all other variables.
coef(glm(survival~fare+age+sex, data=titanic, family=binomial(logit)))
## (Intercept) fare age sexFemale ## -1.466095031 0.010026146 -0.009064943 2.325909911
p1 <- exp(-1.33+0.01*33.3-0.011*29.9)/(1+exp(-1.33+0.01*33.3-0.011*29.9)) p2 <- exp(-1.33+0.01*33.3-0.011*29.9+2.36)/(1+exp(-1.33+0.01*33.3-0.011*29.9+2.36)) p2-p1
## [1] 0.5278716
When at the mean of age and fare, women’s probability of survival is 53 percentage points higher than men’s.
mfx package for marginal effectsMarginal effects are easy to calculate in the mfx package:
library(mfx)
logitmfx(survival~fare+age+sex, data=titanic)
## Call: ## logitmfx(formula = survival ~ fare + age + sex, data = titanic) ## ## Marginal Effects: ## dF/dx Std. Err. z P>|z| ## fare 0.00231089 0.00039964 5.7824 7.365e-09 *** ## age -0.00208935 0.00115590 -1.8075 0.07068 . ## sexFemale 0.51833735 0.02577096 20.1132 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## dF/dx is for discrete change for the following variables: ## ## [1] "sexFemale"
glm command by using binomial(probit) as the family argument rather than binomial(logit).A probability of 0.84 would correspond to a probit score of one. This is the point on the standard normal distribution (mean=0, sd=1) where 84% of the area is to the right.
There is no \(R^2\) value for a generalized linear model. A model predicting categorical outcomes does not have residuals in the same sense as the OLS regression model.
Most measures of model fit for GLMs rely up on the deviance of the model, or \(G^2\). The deviance is simply the log-likelihood of the model multiplied by negative 2:
\[G^2=-2(logL)\]
The smaller the deviance, the better the fit of the model, because the likelihood is getting higher. Typically we are concerned with the deviance of three conceptual models:
One approach is to calculate a measure that is similar to \(R^2\) which measures the proportion of deviance from the null model that is accounted for in the current model. This measure is called McFadden’s D.
\[D=\frac{G^2_0-G^2}{G^2_0}\]
The number on top is the reduction in deviance from the null to the current model. This number is then divided by the null deviance.
Just like \(R^2\) this measures takes no account of parsimony. More complex models will always have a larger \(D\) value.
The Likelihood Ratio Test (LRT) is the analog to the F-test for OLS regression models. Any two nested models can be compared using the LRT. The test statistic for the LRT is the reduction in deviance in the more complex model:
\[G^2_2-G^2_1\]
Assuming the null hypothesis that all of the additional terms in the second model have no predictive power (all \(\beta=0\)), this observed difference should come from a \(\chi^2\) (chi-squared) distribution with degrees of freedom equal to the number of additional terms in the second model.
model.gender <- glm(survival~sex, data=titanic, family=binomial) model.gender$null.deviance-model.gender$deviance
## [1] 372.9213
A model predicting survival by gender alone reduced the deviance by 373 compared to the null model which assumed equal probability of survival for every passenger. What is the p-value for this?
1-pchisq(373,1)
## [1] 0
So close to zero that R is reporting zero.
anova to do LRTWe can also use the anova command to do an LRT test if we add in the additional argument of test="LRT":
anova(model.gender, test="LRT")
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: survival ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 1308 1741.0 ## sex 1 372.92 1307 1368.1 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.complex <- update(model.gender, .~.+pclass) anova(model.gender, model.complex, test="LRT")
## Analysis of Deviance Table ## ## Model 1: survival ~ sex ## Model 2: survival ~ sex + pclass ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 1307 1368.1 ## 2 1305 1257.2 2 110.88 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can also calculate BIC for generalized linear models. The general formula to compare any two GLMs by BIC is:
\[BIC=(G^2_2-G^2_1)+(p_2-p_1)\log n\] The first part is the difference in deviance between the two models which measures goodness of fit and the second part is the parsimony penalty. If BIC is negative, model 2 is preferred to model 1. If BIC is positive, model 1 is preferred to model 2.
If you are comparing the current model to the null model, then BIC becomes:
\[BIC'=(G^2-G^2_0)+(p)\log n\]
You can use this value to compare any two models since the difference in BIC’ between any two models is equivalent to the BIC between them.
Write a function for BIC compared to the null
BIC.null.glm <- function(model) {
n <- length(model$resid)
p <- length(model$coef)-1
return((model$deviance-model$null.deviance)+p*log(n))
}
BIC.null.glm(model.complex)-BIC.null.glm(model.gender)
## [1] -96.52692
(model.complex$deviance-model.gender$deviance)+(3-1)*log(nrow(titanic))
## [1] -96.52692
Both approaches lead to the same BIC difference. We prefer the more complex model.
| Uses Alcohol | ||||||
| (1) | (2) | (3) | (4) | (5) | (6) | |
| grade | 0.36*** | 0.37*** | 0.38*** | 0.38*** | 0.38*** | 0.37*** |
| (0.03) | (0.03) | (0.03) | (0.03) | (0.03) | (0.03) | |
| male | 0.33*** | 0.26*** | 0.16* | 0.18** | 0.19 | |
| (0.08) | (0.09) | (0.09) | (0.09) | (0.12) | ||
| gpa | -0.34*** | -0.35*** | -0.39*** | -0.35*** | -0.41*** | |
| (0.06) | (0.06) | (0.06) | (0.06) | (0.06) | ||
| honor society | 0.02 | 0.01 | -0.08 | 0.01 | ||
| (0.16) | (0.16) | (0.16) | (0.16) | |||
| band or choir | -0.34*** | -0.37*** | -0.34*** | -0.41*** | ||
| (0.11) | (0.11) | (0.11) | (0.11) | |||
| number of sports | 0.12*** | 0.09** | 0.13** | |||
| (0.03) | (0.04) | (0.05) | ||||
| popularty | 0.07*** | 0.08*** | ||||
| (0.01) | (0.01) | |||||
| male x popularity | -0.02 | |||||
| (0.07) | ||||||
| Constant | -5.29*** | -4.41*** | -4.50*** | -4.72*** | -4.52*** | -4.38*** |
| (0.28) | (0.32) | (0.33) | (0.34) | (0.34) | (0.32) | |
| Deviance | 3660.46 | 3591.31 | 3571.67 | 3531.17 | 3571.54 | 3543.69 |
| pseudo-R2 | 0.05 | 0.07 | 0.08 | 0.09 | 0.08 | 0.08 |
| BIC’ | -185.47 | -204 | -206.92 | -239.06 | -198.69 | -251.62 |
| Observations | 4,317 | 4,286 | 4,286 | 4,286 | 4,286 | 4,286 |
| Note: | p<0.1; p<0.05; p<0.01 | |||||
library(BMA)
model.bma <- bic.glm(addhealth[,c("grade","sex","pseudoGPA","honorsociety",
"bandchoir","academicclub","nsports","indegree")],
addhealth$alcoholuse, glm.family="binomial")
summary(model.bma)
9 models were selected
Best 5 models (cumulative posterior probability = 0.8598 ):
p!=0 EV SD model 1 model 2 model 3 model 4 model 5
Intercept 100 -4.48889 0.33973 -4.380e+00 -4.562e+00 -4.571e+00 -4.357e+00 -4.557e+00
grade 100.0 0.37028 0.02898 3.654e-01 3.793e-01 3.668e-01 3.643e-01 3.658e-01
sex 27.9 0.06456 0.11425 . . 2.267e-01 . 2.396e-01
pseudoGPA 100.0 -0.41212 0.06107 -4.064e-01 -4.212e-01 -3.868e-01 -4.270e-01 -4.078e-01
honorsociety 0.0 0.00000 0.00000 . . . . .
bandchoir 93.9 -0.38010 0.14804 -4.072e-01 -4.119e-01 -3.663e-01 -4.319e-01 -3.907e-01
academicclub 24.3 0.06769 0.13120 . . . 2.797e-01 2.967e-01
nsports 31.9 0.03082 0.04925 . 9.941e-02 . . .
indegree 100.0 0.07441 0.01129 7.633e-02 7.129e-02 7.651e-02 7.523e-02 7.531e-02
nVar 4 5 5 5 6
BIC -3.228e+04 -3.228e+04 -3.228e+04 -3.228e+04 -3.228e+04
post prob 0.295 0.240 0.131 0.118 0.077
I have taken a subset of 100 observations of the Add Health to show you an example of separation. For this subset, lets look at the two-way table between smoking and band/choir participation.
table(addhealth.samp$bandchoir, addhealth.samp$smoker)
## ## FALSE TRUE ## FALSE 62 11 ## TRUE 25 0
In this sample, none of the band/choir participants were smokers, leading to a zero cell. We can’t estimate the odds ratio of this table and the logistic regression model will suffer from separation:
summary(glm(smoker~bandchoir, data=addhealth.samp, family=binomial))$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.729239 0.3271668 -5.285496714 1.253641e-07 ## bandchoirTRUE -17.836829 2150.8026192 -0.008293104 9.933831e-01
The very large coefficient and SE are clear signs of separation.
In this case, the band/choir variable is dichotomous, so it is not possible to collapse categories. This model can be estimated with adjusments for separation using a penalized likelihood model. These types of models apply a penalty to the maximum likelihood estimation for very large coefficients. They have multiple uses but were developed originally to deal precisely with the problem of separation.
The logistf package provides a penalized likelihood logistic model in R:
library(logistf) logistf(smoker~bandchoir, data=addhealth.samp)
## logistf(formula = smoker ~ bandchoir, data = addhealth.samp) ## Model fitted by Penalized ML ## Confidence intervals and p-values by Profile Likelihood ## ## coef se(coef) lower 0.95 upper 0.95 Chisq ## (Intercept) -1.692820 0.3230586 -2.37358 -1.1037637 38.65395 ## bandchoirTRUE -2.239006 1.4916940 -7.10740 -0.1348222 4.51160 ## p ## (Intercept) 5.060075e-10 ## bandchoirTRUE 3.366574e-02 ## ## Likelihood ratio test=4.5116 on 1 df, p=0.03366574, n=98